A Hierarchic Architecture for Conceptual Information Retrieval

نویسندگان

  • Shih Hao Li
  • Peter B Danzig
چکیده

Conceptual retrieval returns information related to a speci c topic but not restricted to a query term A common approach is to compare the query with all the documents in the database When the number of documents is large the searching time becomes signi cant In this paper we propose a hierarchic architecture which integrates latent semantic indexing LSI and hierarchic agglomerative clustering to reduce the searching time We employ three clustering algorithms single link complete link and group average and conduct experiments on four standard document collections CACM CISI CRAN and MED The experimental results show our method requires less searching time while maintaining comparable retrieval e ectiveness as non clustered search Introduction Searching information by conceptual meanings often su ers from the vocabulary problem which states that users may fail to obtain desired information if the query terms used are di erent from those indexed by the retrieval system To address the vocabulary problem Deerwester et al proposed Latent Semantic Indexing LSI where queries and documents are represented as vectors of conceptual meanings LSI compares a query with all the documents in the database then returns those with higher similarity to the user It has been tested on several information systems with promising results A de ciency of LSI is that it searches through the whole database The searching time becomes signi cant when the database is large One way to ameliorate this problem is to search documents by clusters instead of each individual record This approach is based on van Rijsbergen s cluster hypothesis where he stated closely associated documents tend to be relevant to the same queries He also proposed cluster based retrieval on hierarchically clustered collections to improve retrieval e ectiveness and e ciency In this paper we describe a hierarchic architecture that applies hierarchic clustering to LSI documents and compare its performance with non clustered LSI retrieval on several document collections We review the methodology of LSI and hierarchic clustering in Section We describe our architecture in Section and show the experimental results in Section Section presents the conclusions Background Latent Semantic Indexing LSI is an extension of Salton s Vector Space Model in which documents and queries are repre sented as vectors of term frequencies or weights To capture the semantic structure among docu ments in a database LSI applies Singular Value Decomposition SVD to a term document matrix representing the database and generates vectors of k typically to orthogonal indexing dimensions where each dimension represents a linearly independent concept The k dimensional vectors are used to represent both documents and terms in the same semantic space while their values indicate the degrees of association with the k underlying concepts Figure shows SVD applies to a term document matrix term-doc matrix db term matrix (m x n) (m x k) document matrix (n x k) SVD (k) Figure SVD applies to an m n term document matrix where m and n are the numbers of terms and documents in the database and k is the indexing dimension used by SVD A query vector in LSI is the weighted sum of its component term vectors For example a p term query is represented as the average sum of the p decomposed term vectors To determine relevant documents the query vector is compared with all the document vectors and those with the highest cosine coe cient are returned Because the indexing dimension k is chosen much smaller than the number of terms and documents in the database i e the number of rows and columns in the term document matrix those k concepts are neither term nor document frequencies but are compressed forms of both Therefore a query can hit documents without having common terms but with common concepts Hierarchic Agglomerative Clustering To cluster LSI documents we apply the hierarchic agglomerative clustering method Hierarchic ag glomerative clustering has been studied to increase retrieval e ectiveness and e ciency as compared to the conventional search of non clustered data A typical hierarchic clustering method can be described as follows Compute all pair wise inter document similarity coe cients Place each document in a cluster of its own Form a new cluster by merging the most similar pair of current clusters Recompute the similarity between the newly merged cluster and the remaining clusters Repeat step while there is more than one cluster The output of a hierarchic clustering algorithm is a cluster hierarchy as shown in Figure Various clustering methods di er in the manner in which they de ne the similarity between clusters Three of the most commonly used methods in information retrieval are single link complete link and group average Single link clustering uses the similarity between the most similar pair of documents one in each cluster as the similarity between the two clusters Complete link clustering uses that of the least similar document pair in the two clusters Group average clustering uses the average similarity of all document pairs to be the inter cluster similarity Early experiments showed that the performance of hierarchic clustering methods varies when tested on di erent document collections

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The effectiveness of query-specific hierarchic clustering in information retrieval

Hierarchic document clustering has been widely applied to Information Retrieval (IR) on the grounds of its potential improved effectiveness over inverted file search. However, previous research has been inconclusive as to whether clustering does bring improvements. In this paper we take the view that if hierarchic clustering is applied to search results (query-specific clustering), then it has ...

متن کامل

Emotions and information seeking: how does emotion manifest in infor-mation seeking behavior?

Background and Aim: Information seeking behavior arises when one feels a void in his/her knowledge which inspires one to acquire new information. The central point in explaining behavior is the fact that many features influence its occurrence, and emotions are considered to be a major element involved in human information behavior.  Also, Information seeking is a positive and negative emotional...

متن کامل

The effectiveness of query-based hierarchic clustering of documents for information retrieval

................................................................................................................................................... i ACKNOWLEDGEMENTS......................................................................................................................... iii LIST OF FIGURES ............................................................................................

متن کامل

Conceptual Relationships and Diffusion Model of Information, Misinformation and Disinformation

Background and Aim: Proper management of the information process requires considering various definitions and combinations of the term "information". The purpose of this study was to clarify the concepts of information, misinformation and disinformation, and to better understand the ways of sharing, differentiation and relationships between them, and to explain the patterns and motivations for ...

متن کامل

Automatic Workflow Generation and Modification by Enterprise Ontologies and Documents

This article presents a novel method and development paradigm that proposes a general template for an enterprise information structure and allows for the automatic generation and modification of enterprise workflows. This dynamically integrated workflow development approach utilises a conceptual ontology of domain processes and tasks, enterprise charts, and enterprise entities. It also suggests...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1996